Indexed Searching on Proteins Using a Suffix Sequoia
نویسنده
چکیده
Approximate searching on protein sequence data under arbitrary cost models is not supported by database indexing technology. We present a new data structure, suffix sequoia, which reduces the time complexity of the dynamic programming (DP) matrix calculation required in approximate matching. The data structure is compact. It uses just over 4 Bytes per symbol indexed. We show that time complexity of the DP calculation is O(qg) for a pattern of length q, alphabet size g, and indexing window size d. The DP calculation requires no disk access, and can be executed efficiently. The second phase of the algorithm is based on sequential disk access, and appears to be effective. Approximate matching experiments are promising and offer a lot of scope for algorithm refinement and data structure engineering.
منابع مشابه
PSISA: An Algorithm for Indexing and Searching Protein Structure using Suffix Arrays
Protein Structure Indexing using Suffix Array (PSISA) is a new technique provides the ability to retrieve similarities of proteins based on the proteins structures. Indexing the protein structure is one approach of searching for protein similarities. In this paper we developed our proposed technique based on novel use of suffix array. We start by converting protein structure into a sequence by ...
متن کاملThe Suffix Sequoia
Standard technologies for sequence searching do not use database indexes. These solutions can be divided into exhaustive algorithms, e.g. the Smith-Waterman algorithm [11], and heuristic ones, like BLAST [1, 2], FASTA [10], and BLAT [7]. Specialised tools for DNA matching exist, such as SIM4 [3] and SSAHA [9]. Only BLAT and SSAHA use indexing. BLAT can be used with proteins, however, its sensit...
متن کاملProtein Structure Searching using Suffix Arrays
Searching for similarities of proteins using Structured-based query, has a vital role in many applications like drug discovery and drug design, disease diagnosis and treatment and protein classification. Indexing the protein structure is one approach of searching protein structure for similarities. In this paper we proposed a method to enhance the memory space for storing the indexed data witho...
متن کاملString matching with alphabet sampling
We introduce a novel alphabet sampling technique for speeding up both online and indexed string matching. We choose a subset of the alphabet and extract the corresponding subsequence of the text. Online or indexed searching is then carried out on the extracted subsequence, and candidate matches are verified in the full text. We show that this speeds up online searching, especially for moderate ...
متن کاملPSIST: A Scalable Approach to Indexing
6 Approaches for indexing proteins, and for fast and scalable searching for struc7 tures similar to a query structure have important applications such as protein struc8 ture and function prediction, protein classification and drug discovery. In this paper, 9 we develop a new method for extracting local structural (or geometric) features from 10 protein structures. These feature vectors are in t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Data Eng. Bull.
دوره 27 شماره
صفحات -
تاریخ انتشار 2004